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Abstract 

By restricting the possible values of the proportion of null hypotheses that are true, 
the local false discovery rate (hFDR) can be estimated using as few as one comparison. 
The proportion of proteins with equivalent abundance was estimated to be about 20% 
for patient group I and about 90% for group II. The simultaneously-estimated LFDRs 



give approximately the same inferences as individual-protein confidence levels for group 
I but are much closer to individual-protein LFDR estimates for group II. Simulations 
confirm that confidence-based inference or LFDR-based inference performs markedly 
better for low or high proportions of true null hypotheses, respectively. 

Keywords: confidence distribution; empirical Bayes; Lindley's paradox; local false discovery 
rate; multiple comparison procedure; multiple testing; observed confidence level; restricted 
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1 Introduction 

In the development of statistical methods for interpreting high-dimensional genomics data, 
the challenges involved in analyzing genomics data sets of much smaller scale have been 
largely overlooked, and yet such data are routinely generated. Out of the thousands of genes 
in the human genome, the expression levels of only on the order of 30 genes are measured 
in a real-time polymerase chain reaction experiment. Among the hundreds of thousands of 
proteins in the human proteome, the abundance levels of only on the order of 200 proteins 
are measured with mass spectrometry. The following idealization of the candidate-gene 
approach to genetic association studies poses a problem encountered in analyzing data from 
a small fraction of a large number of biological features, with each feature corresponding to 
a different population in the sampling theory sense. 

Example 1. Consider 10 6 populations such that Xi ~ N(/i;,l) for % = 1,...,10 6 , where 
Hi = 2 for Ni values of i and ^ — for 10 6 — Ni values of i. None of the random values is 
observed except x±, the realization of X\. The null hypothesis of interest is /ii = 0. Let $ 
and respectively denote the standard normal distribution function and density function. 
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Without any knowledge of Ni, few would question the applicability of the p- value 1 — $ (xi). 
On the other hand, in the absence of other information, the use of P (/ii = 0; N±) = 1— iVi/lO 6 
as an approximate, nonsubjective prior probability of the null hypothesis in order to obtain 
the approximate posterior probability 



;i-JV 1 /10»)^(i 1 ) + (JV 1 /10 l =)«i(x 1 -2) 



would not be controversial if Ni were known. Suppose that Ni is unknown but can be safely 
assumed to be between 1 and 100. Then, for at least 99.99% of the populations, the null 
hypothesis is true and thus 1 — $ (^G.) ~ U (0, 1). By contrast, for those same populations, 
P (jii = 0\Xi; Nij Rj 1 with high probability regardless of the value Ni between 1 and 100 
that is guessed for N\ in computing the posterior probability. For instance, if X\ = 2, then 
the p- value is 1 — $ (2) = 2.28% even though the posterior probability of the null hypothesis is 
at least P (/ii = 0|2; 100) = 99.93% and possibly as high as P {p. x = 0|2; 1) = 1 - 7.39 x 10" 6 . 



Lindley (1957) thoroughly examined a similar "paradox" from a more Bayesian viewpoint. 



The type of problem faced in Example [T] will be attacked by adapting methodology 
recently developed for gene expression microarray data to two other settings: (1) those with 
data available for testing only a much smaller number of hypotheses and (2) those with much 
smaller proportions of null hypotheses that are true. 

Microarray technology enables the measurement of levels of gene expression for thousands 
of genes in cells under two different conditions, conveniently labeled as treatment and control. 
Which genes have differential expression in the mean between the treatment and control 



populations? That large-scale problem of multiple comparisons led Efron et al. (2001) to 



apply the false discovery rate (FDR) of Benjamini and Hochberg (1995) and to introduce 
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the local false discovery rate (LFDR). In accordance with its name, the LFDR is a rate of 
Type I errors that would be incurred were the null hypothesis rejected every time the same 
data are generated as those actually observed. In the microarray context, the LFDR is an 
empirical Bayes posterior probability of the null hypothesis that a particular gene does not 
have differential expression, as in equation ([I]). More precisely, the LFDR is defined as the 
prior probability of the null hypothesis conditional on the p-value or other statistic that 



reduces the measured expression levels of the gene to a single number (Efron, 2010b). 

Here, like in Example [TJ the prior probability approximates an unknown proportion of 
null hypotheses that are true, with each null hypothesis corresponding to a different gene. In 
that sense, the LFDR differs from a fully Bayesian posterior probability, which requires the 
complete specification of the prior distribution of all unknown parameters. Such specification 
usually involves prior probabilities that correspond to hypothetical levels of belief rather than 
real relative frequencies or proportions. Thus, whereas a purely Bayesian prior is necessarily 
known in principle, empirical Bayes priors are unknown. 

Since the LFDR generally depends on parameters that do not have a known prior dis- 
tribution, the LFDR can only be estimated. Supposing, however, that the LFDR could be 
known and neglecting any information lost in reducing the data to a test statistic for each 
hypothesis, Bayes decision rules based on the LFDR would have optimal Bayes risk. That 
is, they would perform at least as well on average as any other decision rule with respect to 
any bounded loss function. Knowledge of the LFDR would require knowledge not only of 
the proportion of null hypotheses that are true but also the distribution of the reduced data 
under the alternative hypotheses. In that case, there would be no objection against relying 
on the LFDR derived from Bayes's theorem since frequentists by principle condition on the 



data in the presence of a known population of parameter values (Fisher, 1973, Wilkinson 
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1977, 


Edwards 


1992 


Kyburg and Teng 


2006, 


Hald 


2007 



With that knowledge, the unquestioned applicability of the LFDR would hold regardless of 
the number of hypotheses that correspond to measurements. As a result, the LFDR would 
apply to a single comparison corresponding to a hypothesis randomly drawn from the pop- 
ulation (Example [T]) no less than to multiple comparisons spanning the entire population of 
hypotheses. 

However, it is generally believed that the LFDR can only be adequately estimated if 
there are data directly related to thousands of hypotheses. For example, if data are only 
available for 20 genes, or, in the case study of this paper, 20 proteins, then the LFDR is not 
considered applicable. Indeed, empirical Bayes methods designed for several thousands of 
comparisons do not necessarily work as well with smaller numbers of hypotheses. 

In some respects, that limitation of the empirical Bayes framework restricts the utility 
of multiple comparison procedures more generally. The discussions of two empirical Bayes 



papers spanning the last three decades (Morris, 1983a, Efron, 2010a) illustrate the consensus 



that very different procedures seem suitable for different numbers of comparisons. |Westfall 



(2010) emphasized in his comment that whereas methods that control family-wise error 
rates (FWERs) have insufficient statistical power for very large numbers of comparisons, 



estimators of FDRs and LFDRs become unreliable for small numbers of comparisons. Efron 



(2010c) replied with a recommendation for FWER control for smaller numbers of comparisons 
as a substitute for empirical Bayes estimation of the FDR for larger numbers of comparisons. 



That conflicts with the viewpoint of Morris (1983b), another pioneer of empirical Bayes 
procedures, who resorted to fully Bayesian procedures for small numbers of comparisons. 

The main purpose of this paper is to extend the scope of LFDR estimation to the smallest 
possible scale: that of a single comparison. The investigation will involve modifying a 
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successful method of LFDR estimation and studying its relative performance in various 
contexts. It will be compared to fully Bayesian inference under a default prior and to the 
p-value interpreted inferentially with the aid of confidence distributions. The importance 
of the p-value in the multiple comparison framework lies in the fact that it is equal to the 
p-value adjusted to control an error rate when only one comparison is made. For example, 
with data for only a single hypothesis test, the achieved FDR, the lowest value at which the 



FDR has guaranteed control, is equal to the p-value (Benjamini and Hochberg, 1995). 



Were such a method of small-scale LFDR estimation available for small-scale genetic asso- 
ciation studies, the widespread publication of significant findings that could not be replicated 



(Morgenthaler and Thilly, 2007) might have been avoided. The reason is that LFDR esti- 



mation takes advantage of an estimate of the proportion of null hypotheses that are true, 
which is crucial for extremely small proportions, whereas p-values ignore that information, 
thereby inflating the Type I error rate of testing a hypothesis picked at random. 

Example 2. For testing hundreds of thousands of genetic variants for association with 



disease, FWER control in the tradition of Bonferroni, Sidak (1967), and Holm (1979) often 



due to the large number of tests, results in the rejection of few or no null hypotheses. The 



alarming number of false positives found in candidate gene studies (Morgenthaler and Thilly 



2007) at first seems to support such adjustments of p-values for the number of tests in order 



to control an FWER. However, the analogous history of false positives in candidate-gene 



studies (Ioannidis et al. , 2001), in which much smaller numbers of tests were performed in 



each study, shows that the number of tests is not the source of the high false-positive rate. 
Rather, the root of the problem lies more in the small number of disease-associated variants 
compared to the total number of variants, irrespective of how many happen to be measured. 



Thus, many join the Wellcome Trust Case Control Consortium (2007) in questioning "the 



view that one should correct significance levels for the number of tests performed to obtain 



'genome-wide significance levels.'" In place of the number of tests performed, the Wellcome 



Trust Case Control Consortium (2007) uses the proportion of variants that are associated 
with disease as the prior probability of association, an approach that applies in principle 
even to data representing only a single variant. That proportion is thought to be between 
10~ 6 and 10~ 4 , as in Example [TJ 

Section [2] introduces a parametric method that enables empirical Bayes inference even 
in the absence of multiple comparisons. Next, Section [3] derives rival posterior distributions 
from confidence intervals under fixed-parameter models. An application to proteomics data 
illustrates the empirical Bayes and confidence methods in Section |4j Section [5] compares 
the performance of the empirical Bayes and confidence methods for inference about a single 
scalar parameter value that belongs to some population of parameter values. The paper 
closes in Section [6] with a discussion of the resulting implications on whether empirical Bayes 
or confidence strategies would be more suitable in a given context. 

2 Empirical Bayes methods 

While methods of estimating the LFDR on the basis of nonparametric density estimators 



clearly cannot apply to single-comparison data (Efron, 2010b), it will be seen that fully 
parametric methods of LFDR estimation by maximum likelihood can do so under suffi- 
ciently simple models. Since the empirical Bayes models that define the LFDR have random 
parameters, the likelihood is not maximized over their values but rather over the values of the 
hyperparameters specifying the proportion of null hypotheses that are true and the distribu- 
tion of the reduced data under the alternative hypothesis. Such parameters, if known, would 
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entail knowledge of the LFDR (Q- More generally, the maximization of likelihood over 
hyperparameters is called Type II maximum likelihood as opposed to the Type I maximum 
likelihood of models that lack random parameters (Good 1966). 



2.1 Hierarchical sampling model 
2.1.1 Level 1 of the model 

Consider a reference set of N populations that includes the N populations sampled. Thus, 
TV is the number of comparisons can be made on the basis of available data. For example, 
N may be the number of genes in the genome, whereas N is the number of genes on the 
microarray that measures gene expression or is equal to 1 if the expression of only a single 
gene is measured. Here, a comparison is understood as a hypothesis test or an effect-size 
estimate. 

Let Xi, an observable vector of dimension n, be a random variable of a distribution Pe^Xi, 
which depends on 9i, the parameter of interest, and on Aj, the nuisance parameter, for all 



i G < 1, . . . , N >. Similarly, model Xj, the vector of n observations, as a realization of Xj for 



Those data are reduced as follows. A random statistic U is a function of Xi, and an ob- 
served statistic Uj is a function of Xj, where the same function is applied to alH G jl, . . . , Nj 
and to all j G {1, . . . , N}. Thus, u% is a realization of Ui for all % G {1, ... , N}. 

Supposing the distribution of Ui is indexed by the reduced parameter 5i, a function 
of 9i and Xi, its probability mass function or density function is denoted by / (•; 8i) for 
each i G jl, . . . , Nj. It follows that the probability mass or density of it, is / (m; Si) for 
all i G {1,...,N}. Without loss of generality, the zth null hypothesis is that 9i = or, 




all j G {1, . . . , 



N}. 
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equivalently, Si = 0, for any i E jl, . . . , n\. 



Example 3. Suppose the expression level of each of N genes is measured for a total of 
n treat cell cultures treated with a chemical and j7, contro1 cell cultures not so treated. The 
expression level of the zth gene is the logarithm of a measure of the abundance of mRNA in 
the cells and is IID N (6 ) * reat , Af) within the treatment group and IID N (0f ntro1 , A?) within 
the control group, Aj being the common standard deviation. Then T i; the equal-variance 
Student t test statistic, has a noncentral t distribution with noncentrality parameter A,; = 

^treat _ ^control) ^/^treat + ^control) ~l/2 ^ . ^11-2 = Tl tveat + n COntml - 2 degrees of 

freedom; this is abbreviated by Tj ~ Student (Aj,n — 2). Then Ui = \T(\ is very effective for 
inference about Si = |Aj|. By implication, ?7j is highly informative about the expression fold 
change exp |^* reat — ^control |^ ^ e e ff ec ^ s j ze m ost often estimated in reports of microarray data 
analysis, and about whether ^ reat = ^control s j nce j s necessary and sufficient for Si = 0. 
If n treat 4- ^control j g ] ar g e enough, then Tj ~ N (Aj, 1), which entails that Uf is approximately 
distributed as \ 2 1); the noncentral chi-square distribution with noncentrality parameter 
Sf and 1 degree of freedom. 

The most common model for analyzing genetic association data has the same asymptotics. 

Example 4. Example [2j continued. In order to utilize genetic models such as the additive 



model (Lewis, 2002) and in order to account for effects of covariates, genetic association data 
are typically analyzed using the Wald approximation with logistic regression, yielding the 
statistic Ti equal to the (Type I) maximum likelihood estimate of the log odds ratio divided 
by the estimated standard error of that estimate for variant i of N. The statistic Ui = |Tj| 
is highly informative about the absolute value of the log odds ratio and whether it is equal 
to 0, as under the null hypothesis of no association between the genotype and the trait. For 
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a sufficiently high number of case and control subjects, Uf ~ x 2 1)> as i n Example [3j 
2.1.2 Level 2 of the model 

The first level of the hierarchical model describes the variability of the expression levels 
of each gene or other population that corresponds to a comparison fl p.l.l ). To represent 



variability between populations or comparisons, 5i is now modeled as the random variable 
equal to with probability 7r , equal to some ^ with probability 7i"i, equal to some 
5^ ^ jO,^ 1 -*} with probability 7T2, and equal to some 5^ (fc {0,5^} with probability 
for a K G {1,2,...}. The alternative-hypothesis parameters constitute ip, a matrix with 
(71*1, . . . , ttk) and (5^\ . . . , 5^ K ^ as its two columns. 

Then the unknown hyperparameters are ttq and ip, and the probability mass function or 
density function of Xi is the finite mixture 



K 



f (.; tto, V) = vro/ (•; 0) + £ tt,/ (.; <f< fc >) 



fc=i 



for alH G |l, . . . , iVj. The random indicator V{ will determine whether the null hypothesis 
is true [yi = 1) or false (z/j = 0) for alH G |l, . . . , n\. It is assumed that N is large enough 
that P (z/j = 1) = 7r is approximately X^^i V i/N, the proportion of null hypotheses that are 
true. 

The local false discovery rate, P (ui = l\Ui = Ui) by definition, is 
LFDR [Ui] 7r , ^) = 



by Bayes's theorem. As this LFDR is unknown only because ttq and ip are unknown, it may 
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be estimated by Type II maximum likelihood, as will now be seen. 



2.2 Type II maximum likelihood 



The hyperparameters are estimated by n and ip, the values of tt q and ip at which the 
likelihood 

N 



i=i 



attains its maximum subject to the constraints that Ylk=i = 1 an d < 7Tfc < 1. Then 
LFDR \ Ui\ tcq, ijj ) is the maximum likelihood estimate of the LFDR. 



Pawitan et al. 



(2005), 



Muralidharan (2010), and Yang and Bickel (2010) employed this method of estimating the 



LFDR under fully parametric finite mixtures. 

To prevent overfitting in the form of excessive variance in the estimates, the value of K 



must be smaller for smaller values of N. For that reason, Bickel (2010d) suggested K — 1 



when N < 1000. That model is simpler than those of higher values of K: the only free 
parameters are 7r , the approximate proportion of null hypotheses that are true, and 
the value of the reduced parameter indexing the alternative distribution. However, it is not 
simple enough for a single comparison (N = 1), for in that case, 7ro = almost always. 

More generally, whenever N is deemed too small for reliable estimation of 7r with tt 
only restricted to the interval [0,1], it will be further constrained to the strictly smaller 
interval [hq , vr^] , a proper subset of [0, 1] with the specified bounds Hq and tyq such that 
< 7T^~ < 7Tq~ < 1. Thus, the proposed method guarantees that 7Tq < 7r < tiq even for the 
lowest values of N. 

In the case of iV = 1, there is overfitting in the sense that 7r = tCq almost always. 
Likewise, for small values of N, ijj is not an optimal estimator of ip. Thus, improvements 
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such as those based on predictive distributions are certainly possible (e.g., Bickel, 2011). 
Nonetheless, the application (Q and simulations demonstrate that even the simple 
method introduced here can perform substantially better than methods that take no account 
of the hierarchical structure of the data. It will be seen that with certain distributions 
of unknown parameter values, even extremely crude estimates of the hyperparameters are 
preferable to no estimates at all. 

To prevent problems with numerically maximizing the likelihood, the reduced parameter 
5^ was constrained under the alternative hypothesis to have a lower bound of 10~ 3 for 
Sections [4] and [5j but none of the results was sensitive to the value of that bound. 

3 Confidence methods 

This section confines attention to the single-level model consisting of the model of Section 



2.1.1 with fixed parameters rather than the random parameters of Section 2.1.2 The con- 
cept of confidence posterior distributions will be reviewed to set the stage for the observed 
confidence levels to consider as viable alternatives to LFDRs. 

Let 9CI 1 denote the parameter space of each fixed parameter value 9i in the sense that 
it is the smallest set in which 9{ is known to lie. Likewise, let A denote the parameter space 
of each A,. Whereas the nuisance parameter Aj may be a scalar or vector, it is assumed that 
the interest parameter 9i is a scalar, i.e., that 0CR 1 . 

Consider the random variable that has probability distribution P (•; ttj) on O such 
that 

P (pi < 0i\ ih) = P 0i , Xi {Ui > m) (2) 
for all 9i G and Aj G A, where Ui is a scalar statistic determined by Pe^, the sampling 
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distribution of X, introduced in Section 2.1.1 The random elements of the equation are $j 
on the left-hand side but U{ on the right-hand side. 

The probability measure P (•; m) is the confidence posterior distribution of 9{. The word 
confidence emphasizes the property that the interval bounded by the /^-quantile and the fii- 
quantile of $j is a (P2 — Pi) 100% confidence interval in the sense that it has a (P2 — Pi) 100% 



frequentist probability of including 9i (Efron, 1993, Schweder and Hjort, 2002; Singh et al. 



2005). While the term posterior correctly indicates the dependence of the parameter distri- 



bution P (•; Ui) on the observed statistic u% (Bickel, 2010b| a), it is not necessarily a Bayesian 
posterior, a conditional prior distribution given Ui = Uj. For example, P [9~ <$i< 9 + ;Ui) 
is the confidence posterior probability of the hypothesis that the parameter of interest lies 
between the fixed values 9~ and 9 + and yet need not correspond to any Bayesian poste- 



rior probability of the hypothesis. Polansky (2007) calls P (9 < i?j < 9 + ;Ui) the observed 



confidence level of the hypothesis; cf. Efron and Tibshirani (1998). 



Example 5. Example |3j continued. For simplicity, the statistic is changed to Ui = Ti, which 
is useful for inference about the value of 9{ = #* reat — #c°ntroi^ gj nce 71 ^ Student (Aj, n — 2), 
equation (|2| implies that i?j/<7j ~ Student (ti,n — 2), where o~i is the typical pooled estimate 



of the standard error of the sample mean difference between treatment and control ( Schweder 



and Hjort, 2002). Thus, the confidence posterior distribution of the parameter of interest is 



equivalent to the Bayesian posterior distribution resulting from the improper priors according 
to which the mean and the logarithm of the standard deviation are uniform on the real line. 
Coherence in the Bayesian sense would then require that the same posterior distribution be 
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used for inference about e.g., 



P(\#i\ = 0;U) =P(& i = 0;t i 



P(#i<0;U)- lim P(^<0-e;^) 

<e->0+ 

> ik) - lim P 0+£ , Ai (Ui > Ui ) = 0. (3) 

e— s>0+ 



For Aj = 1 and large n, ~ x 2 (A 2 > 1), which Stein (1959) presented as the fiducial distribu- 



tion for inference about Of, contrasting its interval estimates with confidence intervals. 

The next example extracts a different confidence posterior distribution from the same 
statistical model. 

Example 6. Example[3j continued. Let Ui = |T»| to draw inferences about 0$ = |#t reat — ^control | 
By equation P (•;«»), the confidence posterior distribution of i?,, is defined by 

P($i<e i ;u i ) = Pe l ,xA\T i \>u i ). 

Because Tj ~ Student (0, n — 2) under the null hypothesis that Q{ = 0, the confidence poste- 
rior probability that the null hypothesis is true is equal to the usual two-sided p-value: 



P (tfi = 0; Ui ) =P(#i< 0; m) = P 0)Xi (|T,| > Ui ) . 



(4) 



This is a clear counterexample to the observation of Polansky (2007) and Bickel (2010b) that 



many confidence posteriors e.g., that of Example |5j put no probability mass on any simple 
hypothesis. 

Like the Bayesian posterior, the confidence posterior can be used to make coherent deci- 
sions given a loss function (Bickel, 2010b[a ). In the metaphor of an intelligent agent, whereas 
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the Bayesian posterior describes the decisions made by an agent committed to a particular 
prior distribution, the confidence posterior describes the decisions made by an agent that 



interprets confidence levels from a particular procedure as levels of certainty (Bickel, 2009). 
Thus, the confidence posterior enables direct performance comparisons between frequentist 
procedures and Bayesian and empirical Bayes posteriors, as will be seen in Sections [4] and 

4 Application to proteomics data 

Alex Miron's lab at the Dana-Farber Cancer Institute recorded the abundance level of each of 
20 plasma proteins for every woman of two breast-cancer groups (55 HER2-positive women 



and 35 mostly-ER/PR-positive women) and of a control group (64 healthy women) (Li 



2009). After adding the 25th percentile of the abundance levels within the control group to 



all abundance levels in order to ensure that the adjusted levels were positive (Bickel, 2010d), 
the logarithms of the adjusted levels of a given gene were modeled as quantities drawn from 
a normal distribution with the same variance. 

In comparing each breast-cancer group to the control group, the data for each protein 
were reduced to the absolute value of the equal-variance ^-statistic, which has a Student 
t distribution under the null hypothesis of no difference between groups and a noncentral 
Student t distribution with noncentrality parameter S under the alternative hypothesis of a 
nonzero mean difference, as in Example [3] 

In order to analyze the data of all proteins simultaneously, it was assumed that the re- 
duced data of all proteins with differential abundance levels are absolute values of variates 
drawn from the same noncentral t distribution, the noncentrality parameter of which is 
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denoted by 5. The assumption enabled computing 7t and 8, the maximum likelihood esti- 



mates of 7To and 8, using the empirical Bayes method of Section 22_ with the constraint that 
0% < 7To < 100%. For comparison, the data of each protein were then analyzed individually 
by using the confidence and empirical Bayes methods as if it were the only protein with 
measured expression. 

The results are summarized in Figures [T] and [2] Within each figure, the posterior prob- 
ability estimates of the top-left plot are the LFDRs estimated by substituting 7To and 8 for 
7To and 8, with the vertical line specifying the value of tiq. Each posterior probability of each 
top-right plot is the observed confidence level of the null hypothesis of equivalent abundance 
between cancer and control groups as recorded by equation Q. The bottom two plots of 
each figure report the LFDRs estimated separately for each protein by maximizing the like- 
lihood with the constraints that 7r > 50% (bottom-left plot) and tt > 90% (bottom-left 
plot), with the vertical lines drawn at 50% and 90%, respectively. 

Since only the top-left plot of each figure represents the simultaneous use of the data 
for all proteins, it serves as the reference for evaluating the three methods of analyzing the 
data of each protein in isolation from the other data. As seen in Figure [T], the observed 
confidence levels closely match the simultaneously estimated LFDRs for the HER2-control 
group. By contrast, the individual-protein LFDR estimates come much closer than the 
observed confidence levels to the simultaneously estimated LFDRs for the ER/PR-control 
group (Figure [2]). The explanation for that difference between comparisons is that the 
estimated proportion of equivalent-abundance proteins is low for the first group (ttq = 22%) 
but high for the second group (if = 89%). 
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Figure 1: Empirical distribution functions of the posterior probability that a given protein 
has equivalent abundance between the HER2-positive and control groups. The four methods 
compared are described in the text. 
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Figure 2: Empirical distribution functions of the posterior probability that a given pro- 
tein has equivalent abundance between the ER/PR-positive and control groups. Each plot 
corresponds to a method described in the text. 
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5 Simulation studies 



The simulation studies of the following subsections were carried out in the scenario of Tj ~ 
N(Aj, 1), Ui ~ \Ti\, and 5i = |Aj| since it represents the asymptotics of a wide variety of 
situations encountered in practice, including those of protein abundance (§41), gene expression 
(Example [3]), and genetic association (Example Specifically, the test statistics were the 
absolute values of the realizations drawn from the normal distribution with mean 5 = and 
variance 1 under the null hypothesis and from the normal distribution with mean 5 G {2, 4} 
and variance 1 under the alternative hypothesis. The mean error in estimating the truth 



of the null hypothesis (£5.1) or the rate at which interval estimates cover 5 (£5.2) then 
approximated the expected error or coverage probability of each single-comparison method 
under the null and alternative hypotheses. 

Such approximations enabled approximating the expected error and coverage probability 
for any proportion ti\ of null hypotheses that are false as the weighted average of the expected 
error or coverage probability with weight 1 — 7i"i for the null hypothesis and 7i"i for the 
alternative hypothesis. This quantifies the average performance of applying each single- 
comparison method to data drawn from a randomly selected hypothesis. 

5.1 Hypothesis testing 

The posterior probability that a method attributes to the null hypothesis is its estimate of the 
value of the indicator v% that equals 1 if the null hypothesis is true or if not (jj |2.1.2 ). Each 



method's estimation performance is here defined in terms of the mean squared error (expected 
quadratic loss) for two reasons. First, admissibility under quadratic loss is necessary and 



sufficient for certain desirable properties relevant to conditional inference (Robinson, 1979). 
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Second, quadratic loss is the only proper scoring rule for probabilities that (a) depends 
only on the difference between the estimator and estimand and (b) remains unchanged if 



the estimator and estimand trade places (Savage, 1971). The square root of the expected 
quadratic loss is easily interpreted as an average estimation error. 

The present adoption of the confidence posterior probability of equation ^ is equivalent 
to interpreting the p- value as an estimate of the indicator of whether the null hypothesis is 
true. The p-value used this way does not require a significance threshold and can dominate 
estimates defined to equal if the p-value is below such a threshold and equal to 1 otherwise 



(Hwang et al. , 1992). Fixed-probability tails will be more appropriate for constructing the 



confidence intervals of Section 5.2 since it, unlike the present section, in effect imposes a 0-1 



loss function (Robinson 1979). 

On the basis of 100 realizations of the statistic drawn from each of the three normal 
distributions N (0, 1), N (2, 1), and N (4, 1), Figures [3] and [i] compare the mean quadratic loss 
of several methods of hypothesis testing in the general form of assigning posterior proba- 
bility to the null hypothesis. The vertical lines are drawn at tti = 50%. The 0% posterior 
probability represents any method that necessarily assigns no probability mass to the simple 
null hypothesis, including improper-prior Bayesian updating and all other methods yielding 
posterior density functions (Example [5]). The observed confidence level is the confidence 
posterior probability given by equation ^ with infinite degrees of freedom. Each of the four 
methods of estimating the LFDR imposes a different constraint on ttq when maximizing the 
likelihood. 
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Figure 3: Square root of the mean quadratic error of the (estimated) posterior probabilities of 
null hypothesis truth versus m = 1 — 7Tn. Reduced data were simulated from the unit- variance 
normal distributions of means (true null hypothesis) and 2 (false null hypothesis). 
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Figure 4: Square root of the mean quadratic error of the (estimated) posterior probabilities of 
null hypothesis truth versus tti = 1 — ttq. Reduced data were simulated from the unit- variance 
normal distributions of means (true null hypothesis) and 4 (false null hypothesis). 
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5.2 Effect-size estimation 

An interval estimate of the effect size \8\ is the interval between two quantiles of a posterior 
distribution of \S\, whether a confidence posterior, a Bayesian posterior, or an empirical 
Bayes posterior. For example, the central or equal-tail (1 — a) 100% confidence interval 
corresponding to a confidence posterior is the interval between its a/2 and 1 — a/2 quantiles. 
The coverage rate of an interval estimate is its probability of including the true value of the 
interest parameter, \8\ in the case of the simulation studies. 

Figure [5] displays the coverage rates of the equal-tail 95% interval estimates for simulating 
800 observed test statistics from the null distribution and another 800 from the alternative 
distribution with 5 = 2. The displayed coverage rates are visually indistinguishable from 
those instead using 800 draws from the 5 = 4 distribution. 

The six posterior distributions of Figure [5] are those of Section 5.1 , again with the vertical 
line at tti = 50%. The improper Bayesian posterior induced by the uniform prior distribution 
of 5 represents the class of 0%-posterior methods (Example |5j). Its interval estimates were 



criticized by Stein (1959) and Wilkinson (1977) in favor of the confidence intervals of Figure 



|5j Its assignment of 0% posterior probability to the null hypothesis is evident from equation 



6 Discussion and conclusions 



The proposed method of constraining 7r requires no more subjective input than the popular 
methods of estimating the LFDR that rely on nonparametric density estimation: they depend 



on the assumption that txq be greater than about 90% (Efron, 2004). With sufficiently high 
choices of 7r , all such methods tend to be conservative. 
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Figure 5: Proportion of 95% interval estimates that include the true value of the mean versus 
7Ti = 1 — tiq. Reduced data were simulated from the unit- variance normal distributions of 
means (true null hypothesis) and 2 (false null hypothesis). The 50-100% and 90%-100% 
curves coincide. 
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The objection may be raised that all such choices are unnecessary given the guaranteed 
coverage rates of fixed-parameter confidence intervals. Indeed, although Bayesian and em- 
pirical Bayes methods can cover the true parameter at slightly higher rates, they can also 
have much worse coverage than confidence intervals. For example, empirical Bayes intervals 
based on LFDR estimation have poor coverage at high values of 7i"i (Figure [5]). 

However, the main advantage of LFDR-based interval estimates over fixed-parameter 
confidence intervals lies not in the potential increase in the coverage rate but rather in the 



striking reduction in their width (Ghosh 2009; Efron 2010b Bickel 2010c). That is espe- 
cially true for lower values of 7i"i, as can be seen from the greater and greater concentration 
of posterior probability mass at the null hypothesis as 7Ti — > (Figures [3] and [1]). When- 
ever the posterior probability of the null hypothesis is at least 97.5%, which happens with 
close to 100% frequency for high values of the lower bound 7Tq , the 95% interval estimate is 
[0, 0]. That interval has zero width and yet will cover the true value at a rate of 1 — 7Ti, the 
proportion of null hypotheses (8i = 0) that are true. 

The value of 7i"i also determines whether the LFDR approach performs better or worse 
than the confidence approach in the context of inferring whether or not a null hypothesis is 
true. For 7r!<10%, there is substantial improvement in inference even when 7r^~ is far from 
1 - 7Ti (Figures § § and II). 



Among others, Lindley (1957) and Berger and Sellke (1987) contrasted Bayesian posterior 
probabilities of simple null hypotheses with p-values before the LFDR was defined. The 



results of Berger and Sellke (1987) hold without their reliance on the misinterpretation of 
the p- value as a Bayesian posterior probability since, in confidence-posterior decision theory 



(Bickel, 2010b[a ), the two-sided p- value can be a legitimate confidence posterior probability 
Q. Berger and Sellke (1987) found that the p- value can be far from the actual error rate, 
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which necessarily depends on 7i"i, the proportion of null hypotheses that are false, whether 
or not that proportion is known. That, however, is insufficient for concluding that Bayesian 
testing is superior: in low-information situations, Bayesian posterior probabilities will also 
be far from those that would be computed with knowledge of 7i"i and other model parameters. 
For the practical scientist who does not want to know about error rates but instead whether 
or not the null hypothesis is true, the more important criterion is whether Bayesian posterior 
probabilities or p- values come closer to z/j, the indicator of the truth of the ith null hypothesis. 

Using that criterion actually favored the p-value as an observed confidence level over 
the empirical Bayes methods for 7T!>50% (Figures [lj [3J and [4]). That largely vindicates the 
use of confidence-based methods when all that is known about the parameter of interest is 
encoded either in the model or in the test appropriate for a plausible null hypothesis (£j3j. 

Nonetheless, even with the vague information that the hypothesis tested belongs to a 
relevant class in which most null hypotheses are true, rough guesses at tTq can bring notable 
improvements in inference accuracy. An extreme case is that of genetic association studies 
(Example 0) , for which 7Tf = 10 -6 and = 10 -4 are reasonable lower and upper bounds 



of the proportion of SNPs associated with a given disease (Wellcome Trust Case Control 



Consortium, 2007). 



The need to consider i\\ when making statistical inferences cannot be avoided by run- 
ning algorithms that automatically control the FDR or FWER. The fundamental difference 
between the LFDR and the FDR is exposed at lower numbers of comparisons and especially 
at the single-comparison scale. Since FDR control reduces to standard hypothesis testing 



when there is only a single test (Benjamini and Hochberg 1995), the achieved FDR, like 
any achieved FWER, is the unadjusted p-value and thus is suitable in the same high-7Ti 
situations. 
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